The data was sent to you before the course.
Please place this file on your desktop and unzip it.
We are going to be working on 2 files. Please open:
1. Beginners_R_Basics_Practicals.Rmd
2. Beginners_R_Basics.html
Beginners_R_Basics.html is a html file, and should open in your default internet browser.
Beginners_R_Basics_Practicals.Rmd is a R markdown file, and should have opened up in RStudio.
We need to alter how the output is displayed for Beginners_R_Basics_Practicals.Rmd in RStudio.
We want the output to be displayed in the console NOT within the R markdown file.
To change this:
Click on the Gear Icon > Select "Chunk Output in Console"
Beginners_R_Basics_Practicals.Rmd has a table of contents created using the # which creates headers in Markdown.
To access this: press the dashed line button on the upper right corner of the markdown file.
You should ALWAYS write your code into a script that you save.
This will help avoid having to re-write code that you have already written, saving your self time and energy.
Helps make all the analyses you produce reproducible, which is essential for all scientific research.
It is worth noting that some journals have started to ask for code submission along side papers to make the data and results within publications as transparent as possible. Get into the habit now to write and save scripts.
Let’s open an R script now
Open a new R script
Go to File > New File > R script
The practicals (Beginners_R_Basics_Practicals.Rmd) are written using R Markdown, which we will talk about in detail later.
Look at Beginners_R_Basics_Practicals.Rmd
For the moment the simplest way to do this is to open a new Markdown file which will prompt you to add the required packages if necessary.
To open a new Markdown file:
Go to File > New File > R Markdown...
If this prompts you to install packages, click yes
This will open a dialogue window, click CANCEL.
Very basic use of R is getting R to perform mathematical operations.
Click the green arrow on the right hand side to run this code in the practical markdown.
2 + 2
5 - 3 + 2 * 4^2
11465 * 2358971436
options(scipen = 999)
11465 * 2358971436
What is the difference between these last two runs?
What do you think options(scipen = 999) is doing?
Some examples of mathematical operations:
| Action | Symbol |
|---|---|
| Addition | + |
| Subtraction | - |
| Multiplication | * |
| Division | / |
| Exponentiation | ^ |
| Modulo (remainder of the Euclidean division) | %% |
| Trig functions | sin(), cos(), tan(), acos(), asin(), atan() |
| Natural log | log(), log10(), log2() |
| Exponential | exp() |
| Absolute Value | abs() |
| Square Root | sqrt() |
| Rounding | ceiling(), floor(), round() |
See here for more basic math functions built into R.
Do some mathmatical operations in R
Complete these in Beginners_R_Basics_Practicals.Rmd
The prompt is represented by this symbol >. This means that R is awaiting input telling you that nothing is currently running.
When a line of code is not properly formatted, R will “hang”, and this symbol + will be at the start of the line.
This means R is waiting for the command to be completed/correctly formatted.
When this happens, the command sometimes will need to be “killed” in order to get the input prompt back.
To get back to the input prompt (>) for R, hit ESC
If you do not see > or +, this implies that R is running the previous command submitted.
Click the green arrow to run this code in Beginners_R_Basics_Practicals.Rmd.
5 - 3 + 2 * 4^
Click the green arrow on the right hand side to run this code in Beginners_R_Basics_Practicals.Rmd.
(4 * (5 - (3 + 1))
A variable stores a value or an object.
Use variable’s name to easily access the value or the object that is stored within this variable.
x <- 43
y <- "hello"
Variable names can not contain spaces or special characters. They also can not start with a number.
<- used to assign values to variables
The symbol is a less than follow by a dash : <-
Good style practice : Space out your code!
# No spaces
x<-4+5/6*4
# With spaces
x <- 4 + 5 / 6 * 4
It makes it easier to read/digest when it is spaced out.
Here are two R style guides which covers things like file names, variable names, assignments, spacing, syntax, etc. One from Google and one from Hadley Wickham.
You can also use = for assignment, but it is better to use the <- rather than =.
However, there is a difference in their behaviour which is to do with scoping and environments in R.
It is better to use <- for assignment to avoid any potential future issues.
Best practice:
There are a variety of different types of variable naming conventions:
this_is_called_snake_case
whileThisIsCamelCase
sometimes.there.are.periods
Or_thereAre.thoseThat_Annoy.everyone
For best compatibility with other program (like Python), don’t use sometimes.there.are.periods as this can interfere with Python’s dot notation.
Assign values to variables
Create a list of different animals (barnyard and household animals) each with a different value. Create at least 7 variables.
Use variable name instead of values to perform mathematical operations, or any operation in R. R will replace the variable name with the value it represents.
farmyard <- horses + ...
farmyard
household <- dogs + ...
household
Perform other mathematical operations with variables created
There are two ways to do this:
dogs
print(dogs)
Do not use print if you have a lot of data.
This will print all the rows and columns in the dataset to the screen and if you have a few thousand rows or columns it is really pointless.
RStudio will tab complete previously defined variable names and also commands!
This can be very useful, particularly for long or complicated variable names which are harder to get wrong if R is helping you complete the name.
Remember with computers: You MUST ALWAYS be completely precise in your instructions
Not all data are the same. R treats data differently depending on what type of data it is.
Types data in R:
| Type | Example |
|---|---|
| Character | “words”, “strings of words”, “a”, “abc” |
| Numeric (real or decimal) | 1, 23.25 |
| Integer | 2L |
| Logical | TRUE, FALSE |
| Complex | 1+4i |
The L after the number in the integer tells R that this is an integer. Integer is a subclass of numeric. Sometimes the double class is also used, which is also a subclass of numeric. Complex is for complex numbers with real and imaginary parts.
To determine what type of data something is you can use the class() or typeof() function. These give you slightly different information.
Variables inherit their class from the value or object assigned to that variable.
x <- 3
y <- "cars"
class(x)
## [1] "numeric"
class(y)
## [1] "character"
typeof(x)
## [1] "double"
typeof(y)
## [1] "character"
There are various ways to get help in R.
When you know the exact name of the function : ?function_name, help(mean)
When you don’t completely remember the function name : ??function_name
Let’s see how to get help for the mean function in R (mean()).
Go to the help tab in RStudio in the bottom right quadrant, and type the function name into the search box
Use ?
Also Google it. These are the results from googling R mean help
THE INTERNET IS YOUR FRIEND
Quick-R
Cookbook for R
R-bloggers
Stackoverflow
CrossValidated
Google
Querying the internet can be more useful as people give examples. Usually someone has asked your question previously, and someone has already answered!
Examine the help file for read.table
R is open source and easily expandable. Therefore a lot of people have contributed packages to R over the years. These are functions people have written to perform a wide variety of tasks.
See here for a long list of available packages.
There is also a biology specific repository called BioConductor.
You can also contribute to R in the same way, by writing a package. We will not be covering this in this course, but there is an entire book about it available here.
It is easy to install packages, and can be done via R. There are two general steps:
Install the package
This means downloaded the required code from the web repository. Packages may require other packages to work, therefore more than one package may download when installing a single package.
Load the package into R
Once the packages is downloaded, it then needs to be loaded into the R environment for use. Downloading the package does not make the commands available, it must be loaded into R first.
The easiest way to find most packages is via Google.
# Download package and install it on your computer
# Only has to be done once
install.packages("packagename")
# Load package into R for use
# Must be done in every script that uses the package
library(packagename)
Install and load the library for tidyverse package
Other subject specific repositories exist, like Bioconductor.
Installing packages from Bioconductor has a different specific syntax, which is:
# Download package and install it on your computer from Bioconductor
# Only has to be done once
source("https://bioconductor.org/biocLite.R")
biocLite("packagename")
# Load package into R for use
# Must be done in every script that uses the package
library(packagename)
Once the package is install, you can load the library using the normal syntax.
Get and load the metaCCA package from Bioconductor
When prompted, say no (type n) for updating all packages
You only have to install the package once (not once per script, just once), but you have to load the package every single time you want to use it (once per script).
A few other useful commands:
installed.packages() to see what packages are installed
update.packages() to update installed packages
# List every package installed
installed.packages()
# Update (if there is an update) all installed packages
update.packages()
Most R packages come with a reference manual and vignettes. Depending on who wrote the R packages, vignettes can be extremely useful examples of how to use the package.
Let’s look at the vignettes for dplyr
Again NOT ALL vignettes are this well formatted. Because packages can be contributed to R by anyone, it highly varies from package to package.
Good project management will ultimately make your life easier:
Best Practice for project management:
These are some tips for creating a folder structure of different projects. This is flexible, work with a structure that best suits your working style.
The important point is to have an organized structure.
Do not use spaces, quotes, special characters, or wildcard characters such as * or ? in filenames, as it complicates variable expansion.
Give files consistent names that make logical sense, reflect what the data is and that are easy to match with wildcard patterns to make it easy to select them for looping.
Supplement R code with narration so that in the future it is easier to understand why decisions were made and what the thought process was.
It also allows for integration of UNIX (Bash), Python, SQL, and more.
Make publishable reports in various formats (HTML, PDF, Word, Slideshows).
The practical has been written in Markdown using R.
The .Rmd file can be opened in RStudio, to work on the practical or add in additional text/information as notes.
See here for reference guide, and here for a cheat sheet.
Examine Markdown HTML file and Rmd to compare
Open Beginners_R_Basics.Rmd and compare it to Beginners_R_Basics.html
To open a new Markdown file:
Go to File > New File > R Markdown...
This will open a dialogue window, select Document if you are making a Markdown script and give it a title.
You can select a default output format, but the output format can always alter the output format afterwards.
This will open a new markdown file.
We will be working within R markdown files for the course. The course materials are written in R markdown files, and the practicals will be done within R markdown.
R markdown uses a simple mark up language.
Headers are created using the # symbol.
# Header 1
## Header 2
### Header 3
On a MAC, the # is created by pressing ALT + 3. If you use a keyboard from another language, we might have to google its location.
Stars or underscores can be used to make words italic or bold. A single star or underscore on either side of a word makes it italic, double on either side makes it bold.
*italics* **bold**
_italics_ __bold__
Inserting linkes and pictures can be done using:
# Link
[click here](http://google.com)
# Picture


Lists can be either ordered or unordered:
# Unordered List
* Stuff
* some more stuff
+ sub stuff
+ second sub stuff
# Ordered List
1. first
2. second
3. third
+ third part one
+ third part two
To get a new line, you must finish a line with two spaces. A hard return will not create a new line.
See here for more basics.
You can examine Beginners_R_Basic.Rmd for more examples of how to create tables, links, pictures, etc. using markdown language. It will show you how this HTML was created.
Code chunks can be modified to alter the output.
| Rule | Example (default) | Function |
|---|---|---|
| eval | eval=TRUE | Is the code run and the results included in the output? |
| include | include=TRUE | Are the code and the results included in the output? |
| echo | echo=TRUE | Is the code displayed alongside the results? |
| warning | warning=TRUE | Are warning messages displayed? |
| error | error=FALSE | Are error messages displayed? |
| message | message=TRUE | Are messages displayed? |
| tidy | tidy=FALSE | Is the code reformatted to make it look “tidy”? |
| results | results=“markup” | How are results treated? “hide” = no results “asis” = results without formatting |
| cache | cache=FALSE | Are the results cached for future renders |
| comment | comment=“##” | What character are comments prefaced with? |
| fig.width, fig.height | fig.width=7 | What width/height (in inches) are the plots? |
| fig.align | fig.align=“left” | “left” “right” “center” |
Table taken from here
Altering a markdown file
x <- 75
y <- 238
z <- -21
(x * z) - y
Knit the HTML
x <- 75
y <- 238
z <- -21
(x * z) - y
x <- 75
y <- 238
z <- -21
(x * z) - y
There are lots of ways to customize a markdown file.
Some simple things are to alter the YAML code at the top to include table of contents in the output.
These would like this:
---
title: "Beginners_R_Basics"
author: "Katherine Tansey"
date: "20 September 2017"
output:
html_document:
toc_depth: 4
toc: yes
toc_float: yes
---
The above YAML does:
We have not specified anything for word or PDF outputs, just for HTML, so these options will only effect the formatting of the HTML output.
For more options and customizing, see here, here, and here.
Two things are needed for importing data into R:
Below is a rough example of your computer’s file system, with the blue outlined boxes being directories (folders). The filled in blue squares are files (like a Rscript or word document).
Both MACs and Windows have a root, the very top of the tree.
On MACs the root is a backslash (/).
On Windows the root is the C drive (or D drive depending on your computer’s set up). This looks like C:/, however when using R you need to change the direction of the slashes.
If we want to set Beginners_R directory as our working directory in R, the path would be:
NOTICE : Windows Users the direction of the slash in NOT the same that is in the windows explorer!!!
There are two ways to do this:
Set a working directory
getwd()
# MACS
setwd("/Users/username/Desktop/BeginnersR_Materials_Day1/")
# Windows
setwd("c:/Users/username/Desktop/BeginnersR_Materials_Day1")Why would you want to both setwd() and use a full/absolute path name to a file?
Set working directory
If you are not in BeginnersR_Materials_Day1 (this should be the last name in the path) when you run getwd(), then need to change our working directory using setwd().
REMEMBER you are setting the working directory (folder) not pathing to a specific file.
Depending on what format the data is in (text tab delimited (.txt), comma separated file (.csv), SPSS format (.sav)), a different R function is used.
Every data import function has multiple different options (called arguments).
Arguments change how a file is read in, for example is there a header line or not.
Argument that is always needed is the filename of the data you want to load into R.
Text tab delimited file (file extension .txt)
\t is the computer symbol for a tab. The default seperator for read.table is a space, so if you want a tab you must tell it explicitly.
phenotype_data <- read.table("pheno.txt", sep = "\t")
Comma separated file (file extension .csv)
phenotype_data <- read.csv("pheno.csv")
Excel file (file extension .xlsx)
R can only import a single sheet from an Excel workbook at a time, the default is first sheet. If you want to import any other sheet from the workbook, set the argument sheet to the sheet number you want.
install.packages("readxl")
library(readxl)
phenotype_data <- read_excel("pheno.xlsx")
phenotype_data2 <- read_excel("pheno.xlsx", sheet = 2)
Stata format (file extension .dta)
install.packages("haven")
library(haven)
phenotype_data <- read_dta("pheno.dta")
SPSS format (file extension .sav)
install.packages("haven")
library(haven)
phenotype_data <- read_sav("pheno.sav")
There are A LOT of arguments when it comes to reading data in like:
Where would you go to see what the default arguments are for a command?
Getting data into R
You have been given 5 data sets to load into R. All datasets are located in the data folder within BeginnersR_Materials_Day1.
We will load 2 of them using RStudio’s import functions, and then try to load 3 of them using the command line.
Older versions of RStudio:
File > Import Dataset > From CSV...
Newer versions of RStudio:
File > Import Dataset > From Text (base)...
Or from the Environment panel:
Click on the data folder
Select the correct file from the list (mart_export.txt)
The data preview should look like this:
Under Import Options (bottom left) change Name to de_list
Under Import Options (bottom left) change Skip to 2
The data preview should look like this:
Does the data look correct? Open up the newly made dataframe and have a look.
The data should look like this:
Does the data look correct? Open up the newly made dataframe and have a look by clicking on the object’s name in the Environment panel.
The data should look like this:
The data should look like this:
Again, two things are needed for exporting data into R:
Depending on what format you want the data in (text tab delimited (.txt) or comma separated file (.csv)), a different R function is used.
Every data export function has multiple different options. These can change how a file is export, for example is there a header line or not, do you want row names.
mydata is the name of the dataframe object you want to export.
Text tab delimited file (extension .txt)
write.table(mydata, "mydata.txt", sep="\t")
Comma separated file (extension .csv)
write.csv(mydata, "mydata.csv")
File name is ALWAYS surrounded by quotes.
Useful options to usually set:
Get data out of R
Export the three loaded data sets out of R. Unfortunately this has to be scripted.
First use getwd() to see where you are on your computer
Do you want to export the data here? If not, what do you need to do?
There are a variety of different data structures in R. We are only going to cover three of them:
1. Vector
2. Data Frame
3. Factor
For more information on the rest of them, see here, here and here
Vectors are collections of values all of the same type (meaning characters, numeric, logical) in a one dimensional array.
c() used to combine values into a vector
character_vector <- c("Harry Potter", "Ron Weasley", "Hermione Granger", "Neville Longbottom")
numeric_vector <- c(1,2,3,4,5,6)
logical_vector <- c(TRUE, FALSE, FALSE, TRUE)
Subsetting, or extracting, individual elements from a vector occurs by using square bracket notation [ ] and the number of the elements in the vector that you want.
R counts from 1
Indexing vectors
character_vector <- c("Harry Potter", "Ron Weasley", "Hermione Granger", "Neville Longbottom")
character_vector[1]
character_vector[3]
character_vector[1:3]
character_vector[c(1,3)]
Complex indexing of vectors
Vectors can be index in a more complex fashion.
money_earned <- c(36,125,76,251,22)
Find out which values are greater than 100.
money_earned > 100
Extract out just the days where money is great than 100 and save those values as a new factor called good_days
good_days <- money_earned[money_earned > 100]
good_days
%in%
%in% corresponds to the match function
top10 <- c("Noah", "Liam", "Mason", "Jacob", "William", "Ethan", "James", "Alexander", "Michael", "Benjamin")
Mstart <- c("Mason", "Michael")
M_names <- top10[top10 %in% Mstart]
%in% returns a logical vector if thers is a match or not.
top10 %in% Mstart : returns a logical vector for if the match is there or not
top10[top10 %in% Mstart] : subset the vector top10 to only include those values that are TRUE in top10 %in% Mstart
Data Types
How do we check what class or type of data it is?
ex1 <- c(1, 2, "Fizz", 3, "Buzz") # character
ex2 <- c(7, 235, 34.5, TRUE) # numeric
ex3 <- c("work", FALSE, "if") # character
Since a vector can only contain one single data type, R coerces the data into a single type for the vector.
There is an order to this, which is :
character > double > integar > logical
Or put another way, R will make a choice:
Character + Anything_Else -> Character
Number + Integer or Logical -> Number
Integer + Logical -> Integer
If there is any character in the vector, the entire vector will become a character data type.
This is also true in data frames where each individuals column can only contain a single data type.
Factors represent categorical data, and can be ordered or unordered. Factors are stored as integers but have labels associated with them that are unique to each integer.
Use the function factor() to convert a vector in a factor.
sex_vector <- c("Male", "Female", "Female", "Male", "Male")
factor_sex_vector <- factor(sex_vector)
factor_sex_vector
## [1] Male Female Female Male Male
## Levels: Female Male
summary(factor_sex_vector)
## Female Male
## 2 3
What is summary function doing?
We see there are two levels: Female, Male. If the factor is unordered, R will place the level in alphabetically order, NOT in the order they first occur in the vector. R will assign 1 to Female and 2 to Male.
Factors can be ordered. This aligns with ordinal data, which is categorical but has a set order to it.
age_vector <- c("35-50", "51-69", "35-50", "18-34", "51-69", "18-34", "35-50", "18-34", "51-69", "18-34","35-50", "35-50")
factor_age_vector <- factor(age_vector, ordered=T, levels = c("Under 18", "18-34", "35-50", "51-69", "Over 70"))
factor_age_vector
## [1] 35-50 51-69 35-50 18-34 51-69 18-34 35-50 18-34 51-69 18-34 35-50
## [12] 35-50
## Levels: Under 18 < 18-34 < 35-50 < 51-69 < Over 70
summary(factor_age_vector)
## Under 18 18-34 35-50 51-69 Over 70
## 0 4 5 3 0
R will assign 1 to “Under 18”, 2 to “18-34”, 3 to “35-50”, 4 to “51-69”, and 5 to “Over 70”.
Factors
movie_ratings <- c("bad", "okay", "okay", "good", "amazing", "horrible", "bad", "amazing")
Which one do you think is more informative?
Two dimensional object that include both rows and columns. Columns are variables and rows are the observations. Each column can have a different data type.
data.frame() used to create a data frame
Use a data set that is in data.frame format which is preloaded into R called “iris”. For more information about this dataset, see here.
Subsetting, or extracting, individual elements from a data frame occurs by using square bracket notation [ ], similar to how we indexed vectors. HOWEVER, this time we need to tell R both the row and the column that we want, for example:
dataframe[row_number, column_number]
If we want all rows from one column, we would run something like this:
dataframe[ , column_number]
dataframe[, "column_name"]
dataframe$column_name
NOTICE: even though we don’t have ANY row_number, we still need to have a comma (,) to indicate which element we are giving to R (row or columns). The blank space before the command tells R we want ALL rows.
We will learn a different way to work with dataframes using dplyr, but it is good to know how other people may construct their code.
Can you guess what these will return?
iris[,2]
iris[4,]
iris[, c(1,3)]
iris[,"Species"]
iris$Species
Comments
R ignores everything after a #.
Use these to make comments in the code.
Commands can come after the command on the same line as well.